-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic ttrpc and bridge span creation #1379
base: main
Are you sure you want to change the base?
Conversation
4a510c9
to
01b6418
Compare
01b6418
to
668fd6e
Compare
74fd8aa
to
aadc6de
Compare
In general I like it! The one comment that I have is I never loved the default name for gRPC entry spans and in our case here we are going to do |
Do you mean
Yeah ... we are planning to clean up the bridge protocol code (eventually). But it is cludgy to say the least. Unfortunately, the spans on the host will have different names than spans on the Linux guest, which resolve hcsshim/internal/guest/prot/protocol.go Lines 150 to 206 in 25b6734
|
aadc6de
to
98bdbbc
Compare
98bdbbc
to
593c1e7
Compare
Rebased to resolve merge conflicts |
@@ -248,8 +228,9 @@ func (c *Container) Wait() error { | |||
} | |||
|
|||
func (c *Container) waitBackground() { | |||
ctx, span := oc.StartSpan(context.Background(), "gcs::Container::waitBackground") | |||
_, span := oc.StartSpan(context.Background(), "gcs::Container::waitBackground") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we keeping this span but not the others on a Container
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the other Container
operations eventually end up with a gcs
and then bridge call
, so they have spans being created for them.
This is the only one that doesn't. Plus this is a long running call that is launched asynchronously, so we dont have any other way of tracking it, which i felt made it worth of a dedicated span
var err error | ||
_, span := oc.StartSpan(context.Background(), "gcs::Process::waitBackground") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here. Why are we starting a span here?
internal/guest/bridge/bridge_v2.go
Outdated
ctx := r.Context | ||
log.G(ctx).Trace("opengcs::bridge::modifySettingsV2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here. Should this and dumpStacksV2
and deleteContainerStateV2
use logEntry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dumpStacksV2
doesnt apply to a particular container, right?
but yes, deleteContainerStateV2
should
internal/log/hook.go
Outdated
|
||
// Duration format converts a [time.Duration] fields to an appropriate encoding. | ||
// | ||
// Default is [DurationFormatSeconds]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be milliseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, using milliseconds is a better representation? I figured since some of the OCI requests are on the order of seconds, seconds was appropriate, but I can swap out milliseconds if that makes more sense
if s.isSandbox { | ||
span.AddAttributes(trace.StringAttribute("pod-id", s.tid)) | ||
} | ||
s.logEntry(ctx).Trace("State") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we're no longer logging request information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, we now add the entire request to the auto-generated span, instead of manually adding bits of the request to the individual spans we create here
hcsshim/pkg/octtrpc/interceptor.go
Lines 122 to 124 in 593c1e7
if err = unmarshal(req); err == nil { | |
span.AddAttributes(trace.StringAttribute("request", log.Format(ctx, req))) | |
} |
99417e1
to
081c6d9
Compare
081c6d9
to
66588a1
Compare
rebased |
66588a1
to
22fee56
Compare
pkg/octtrpc/interceptor.go
Outdated
defer func() { | ||
if err == nil { | ||
span.AddAttributes(trace.StringAttribute("response", log.Format(ctx, resp))) | ||
} | ||
setSpanStatus(span, err) | ||
}() | ||
if len(o.attrs) > 0 { | ||
span.AddAttributes(o.attrs...) | ||
} | ||
|
||
return method(ctx, unmarshal) | ||
return method(ctx, func(req interface{}) (err error) { | ||
if err = unmarshal(req); err == nil { | ||
span.AddAttributes(trace.StringAttribute("request", log.Format(ctx, req))) | ||
} | ||
return err | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about always logging the request/response:
- These could be quite large (compared to a normal span) in many cases, which would bloat logging data or lead to worse performance.
- These could contain sensitive information that should not be logged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll validate/update the scrubbing for the requests that contain spans, but we already log most components of the pieces of the requests/responses in the span, but we manually add the fields:
hcsshim/cmd/containerd-shim-runhcs-v1/service.go
Lines 126 to 150 in fa3b77d
func (s *service) Create(ctx context.Context, req *task.CreateTaskRequest) (resp *task.CreateTaskResponse, err error) { | |
ctx, span := oc.StartSpan(ctx, "Create") | |
defer span.End() | |
defer func() { | |
if resp != nil { | |
span.AddAttributes(trace.Int64Attribute("pid", int64(resp.Pid))) | |
} | |
oc.SetSpanStatus(span, err) | |
}() | |
span.AddAttributes( | |
trace.StringAttribute("tid", req.ID), | |
trace.StringAttribute("bundle", req.Bundle), | |
// trace.StringAttribute("rootfs", req.Rootfs), TODO: JTERRY75 - | |
// OpenCensus doesnt support slice like our logrus hook | |
trace.BoolAttribute("terminal", req.Terminal), | |
trace.StringAttribute("stdin", req.Stdin), | |
trace.StringAttribute("stdout", req.Stdout), | |
trace.StringAttribute("stderr", req.Stderr), | |
trace.StringAttribute("checkpoint", req.Checkpoint), | |
trace.StringAttribute("parentcheckpoint", req.ParentCheckpoint)) | |
if s.isSandbox { | |
span.AddAttributes(trace.StringAttribute("pod-id", s.tid)) | |
} |
The only exception is Exec process, which carries the OCI spec, and that we log latter on regardless.
Also, we log all requests/responses over the bridge (and log them again on the Linux GCS side):
hcsshim/internal/gcs/bridge.go
Lines 399 to 402 in fa3b77d
brdg.log.WithFields(logrus.Fields{ | |
"payload": string(b), | |
"type": typ.String(), | |
"message-id": id}).Debug("bridge send") |
hcsshim/internal/guest/bridge/bridge.go
Line 322 in fa3b77d
entry.WithField("message", s).Debug("request read message") |
hcsshim/internal/gcs/bridge.go
Lines 298 to 301 in fa3b77d
brdg.log.WithFields(logrus.Fields{ | |
"payload": string(b), | |
"type": typ.String(), | |
"message-id": id}).Debug("bridge receive") |
hcsshim/internal/guest/bridge/bridge.go
Line 386 in fa3b77d
log.G(resp.ctx).WithField("message", string(responseBytes)).Debug("request write response") |
Performance wise, the formatting uses google.golang.org/protobuf/encoding/protojson
, which should be quite performant:
hcsshim/internal/log/format.go
Lines 71 to 79 in fa3b77d
if m, ok := v.(proto.Message); ok { | |
// use canonical JSON encoding for protobufs (instead of [encoding/json]) | |
// https://protobuf.dev/programming-guides/proto3/#json | |
return protojson.MarshalOptions{ | |
AllowPartial: true, | |
// protobuf defaults to camel case for JSON encoding; use proto field name instead (snake case) | |
UseProtoNames: true, | |
}.Marshal(m) | |
} |
22fee56
to
88090d0
Compare
ttrpc interceptor automatically formats (as JSON) and adds server requests and responses as span attributes. Host and guest bridge automatically creates spans for RPCs. Replaced redundant `cmd/containerd-shim-runhcs-v1/service.go.(Service)` `internal/gcs`, and `interna\guest\bridge` spans with `.Trace` log entries, since they include all the same (or less) information as those automatically generated by the ttrp interceptor and `(*bridge).RPC` Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
Allow enabling adding the request/response to the ttrpc span as attributes, since there are other customers of our code. Allow scrubbing the `proto.Message` request and response messages in the ttrpc server interceptor by specifying an arbitrary function to update a clone of the payloads. Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
88090d0
to
d455f6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(Host) Bridge automatically creates spans for RPCs.
Update ttrpc interceptor to add request and response structs to span attributes.
Update gcs bridge to create spans for outgoing calls automatically.
Replaced redundant
cmd/containerd-shim-runhcs-v1.(*Service)
and
internal/gcs
spans with.Trace
log entries, since they includeall the same (or less) information as those automatically generated by
the ttrp interceptor and
(*bridge).RPC
Relies on PRs:
rpcProc
string conversion #1391This PR gets rid of the
cmd/containerd-shim-runhcs-v1.(*Service)
spans (eg,"State"
,"Create"
)and instead only exports the TTRPC spans (eg,
"containerd.task.v2.Task.State"
,"containerd.task.v2.Task.Create"
)